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We introduce two-level discounted games played by two players on a perfect-information stochastic 
game graph. The upper level game is a discounted game and the lower level game is an undiscounted 
reachability game. Two-level games model hierarchical and sequential decision making under uncer- 
tainty across different time scales. We show the existence of pure memoryless optimal strategies for 
both players and an ordered field property for such games. We show that if there is only one player 
(Markov decision processes), then the values can be computed in polynomial time. It follows that 
whether the value of a player is equal to a given rational constant in two-level discounted games can 
be decided in NP n coNP. We also give an alternate strategy improvement algorithm to compute the 
value. 

1 Introduction 

Discrete stochastic games have been extensively studied as models for decision making under adversarial 
interactions in an uncertain environment, and have found many applications, such as in manufacturing 
systems control and inventory management J4j. 

In many such applications, the interaction with the environment occurs in a hierarchical manner, 
intercalated across different time scales. In the short-term, a decision has to be made about choosing 
one of several possible actions. For example, short term decisions can determine whether to buy a 
certain product or another, or whether to increase or decrease production capacity. In the long-term, the 
system gets a profit or a loss at each step based on its existing inventory. Both short-term and long-term 
decisions can potentially involve uncertainty and adversarial interactions. Moreover, long term decisions 
are influenced by the short term actions chosen, and model the effect of the local decisions on the overall 
profits or losses of the system. 

Technically, the two types of interaction are modeled using two distinct classes of games. Undis- 
counted reachability games are used to model short-term decision making (e.g., what action to take 
next). In a reachability game on a state space, one fixes a set of goal states, and the objective of player 1 
is to maximize the probability of reaching the goal states. 

On the other hand, discounted reward games model long-term rewards for the system (e.g., how the 
actions chosen locally relate to long-term profits). In a discounted game, player 1 gets a reward in each 
step, and a time discount parameter X G (0, 1) is used to "discount" the reward at future time points (i.e., 
the reward r obtained t time units in the future is given a value X'r). The short-term interactions are 
abstracted away into an atomic step that uniformly sets the time granularity. The objective of player 1 is 
to maximize the expected normalized sum of discounted rewards. 

To make the games concrete, consider economic policymakers setting financial policy. The specific 
policy implemented (e.g., the interest rate or the amount of regulation) affects the long-term health of the 
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economy, and the interplay between financial policy and the market can be modeled using discounted 
rewards. However, in each step, the specific policy chosen depends on "short-term" games between 
various stakeholders, such as politicians, the treasury, companies, and various interest groups. In this 
setting, the time granularity of long-term steps (policy implementation) is variable, and depends on the 
length and outcome of the short-term steps (deciding which policy to implement). 

While each game model in itself provides a sound theoretical basis for reasoning about system be- 
havior, the hierarchical interaction and varying time granularities (short-term vs. long-term) present in 
many applications is not adequately captured by either model. In this paper, we introduce models for 
such multi-level interactions and algorithms for sequential decision making in a setting where the time 
granularity can be variable. We define a two-level discounted game, in which a "lower level" reachability 
game is used to decide actions for a "higher level" discounted game. The discount factor is applied to the 
time scale of the higher-level game, not for every step that elapses in the lower-level game. Since every 
lower level game is different we obtain complete independence in granularity of transitions. 

Our main result is the existence of value and pure memoryless strategies in two-level discounted 
games. Moreover, we show that the value and optimal strategies can be computed in polynomial time 
for Markov decision processes, and the complexity of checking if the value is equal to a rational is in 
NP fl co-NP for 2 j -player two-level discounted games. Two-level discounted games subsume classical 
discounted games, and our complexity bounds match the best known results for classical discounted 
games. 

Technically, we combine the existence of pure memoryless strategies in discounted games (7) with 
the existence of pure memoryless strategies in (undiscounted) 2 ^-player reachability games O |4j [6|| to- 
gether with a reduction from two-level games to a one-level discounted game. In particular, we show that 
for Markov decision processes, we can formulate the value at a state as a linear programming problem 
over the states of the two-level game. Thus, the games have an ordered field property: if all constants 
in the definition of the game come from a field F, then the value is also in F; in particular, games with 
rational probabilities and rational discount factors have a rational value. Together with the existence of 
pure memoryless strategies, this implies that the decision problem to check if the value is equal to a 
given rational is in NP n co-NP. We also give a strategy improvement algorithm to compute the value, 
by combining strategy improvement algorithms for stochastic reachability [2] and discounted games |4). 

Thus our new model of stochastic games provides a uniform framework for decision making across 
different time scales, and our algorithms show how to decide optimally in such a framework. 

2 Definitions 

We consider several classes of turn-based games: two-player turn-based probabilistic games (2 ^-player 
games), two-player turn-based deterministic games (2-player games), and Markov decision processes 
(1 ^-player games). 

Notation. For a finite set A, a probability distribution on A is a function 8 : A — > [0, 1] such that 
HaeA^( a ) = 1- We denote the set of probability distributions on A by $)(A). Given a distribution 
8 G @{A), we denote by Supp(<5) = {x e A | 8(x) > 0} the support of 8. 

Game graphs. A turn-based probabilistic game graph (2 1 /2-player game graph) G = 
((S,E),(Si,S2,Sp),8) consists of a directed graph (S,E), a partition (Si, S2, Sp) of the finite set S of 
states, and a probabilistic transition function 8: Sp — > 3>(S), where SH{S) denotes the set of probability 
distributions over the state space 5. The states in Si are the player-l states, where player 1 decides the 
successor state; the states in S2 are the player-2 states, where player 2 decides the successor state; and the 
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states in Sp are the probabilistic states, where the successor state is chosen according to the probabilistic 
transition function 8. We assume that for s G Sp and t G 5, we have (s,t) G E iff 5(s)(?) > 0, and we 
often write 8{s,t) for 8(s){t). For technical convenience we assume that every state in the graph (S,E) 
has at least one outgoing edge. For a state s G S, we write E(s) to denote the set {t G S \ (s,t) G E} of 
possible successors. The size of a game graph G = ((S,E), (Si,S2,Sp),8) is 

|G| = |S| + |E| + ££|S(,)(/)|; 

teSseS P 

where |5(s)(f)| denotes the space to represent the transition probability 8(s)(t) in binary. 

The turn-based deterministic game graphs (2-player game graphs) are the special case of the 2 1/2- 
player game graphs with Sp = 0. The Markov decision processes (1 1 /2-player game graphs) are the 
special case of the 2 ^-player game graphs with Si = or S2 = 0- We refer to the MDPs with S2 = as 
player-l MDPs, and to the MDPs with Si = as player-2 MDPs. 

Plays and strategies. An infinite path, or play, of the game graph G is an infinite sequence ft) = 
(sq,si,S2, ■ ■ ■} of states such that (sk,Sk+i) G E for all k G N. We write Q. for the set of all plays, and 
for a state s G 5, we write Cl s C II for the set of plays that start from the state s. 

A strategy for player 1 is a function a: S* - Si — >■ ^(5) that assigns a probability distribution to all 
finite sequences w G S* ■ Si of states ending in a player-l state (the sequence represents a prefix of a 
play). Player 1 follows the strategy a if in each player-l move, given that the current history of the 
game is w G 5* - Si, she chooses the next state according to the probability distribution o~(w). A strategy 
must prescribe only available moves, i.e., for all w G S*, and s G Si we have Supp(a(w • s)) C E(s). The 
strategies for player 2 are defined analogously. We denote by £ and IT the set of all strategies for player 1 
and player 2, respectively. 

Once a starting state s G S and strategies a G £ and % G IT for the two players are fixed, the outcome 
of the game is a random walk (Os" n for which the probabilities of events are uniquely defined, where an 
event stf C £2 is a measurable set of paths. For a state s G S and an event stf C £l, we write Prf ^(.e/) 
for the probability that a path belongs to stf if the game starts from the state s and the players follow the 
strategies a and %, respectively. Similarly we denote by Ef ,7t (-) the expectation under the probability 
measure Prf ,7t (-). In the context of player-l MDPs we often omit the argument n, because IT is a 
singleton set. 

We classify strategies according to their use of randomization and memory. Strategies that do not use 
randomization are called pure; formally, a player-l strategy a is pure if for all w G S* and s G Si, there 
is a state t G S such that o(w ■ s)(t) = 1. We denote by Z p C Z the set of pure strategies for player 1. In 
order to emphasize the potential use of randomization, we call a (general) strategy randomized. Let M 
be a set called memory, that is, M is a set of memory elements. A player-l strategy a can be described 
as a pair of functions a = (a„, a m ): a memory -update function o u : SxM->H and a next-move function 
a m : Si x M — > 3){S). We can think of strategies with memory as input/output automata computing the 
strategies (see [3] for details). A strategy a = (a u ,G m ) is finite-memory if the memory M is finite, and then 
the size of the strategy a, denoted as \a\, is the size of its memory M, i.e., |a| = |M|. We denote by L F the 
set of finite-memory strategies for player 1, and by L PF the set of pure finite-memory strategies; that is, 
L PF = Z p nr F . The strategy (a u , a m ) is memoryless if |M| = 1; that is, the next move does not depend on 
the history of the play but only on the current state. A memoryless player-l strategy can be represented 
as a function a: Si — > £>(S). A pure memoryless strategy is a pure strategy that is memoryless. A pure 
memoryless strategy for player 1 can be represented as a function a: Si —> S. We denote by Z M the 
set of memoryless strategies for player 1, and by L PM the set of pure memoryless strategies; that is, 
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E™ = E p n£ M . Analogously we define the corresponding strategy families n p , IT F , H PF , Yl M , and H PM 
for player 2. 

Two-level discounted games. A two-level discounted game consists of a turn-based probabilistic game 
graph G; a partion of the state space 5 into (S u ,Si) the set S u of upper level states and the set 5/ of 
lower level states; and a reward function r : S u — >■ R>o that maps every upper level state to a positive 
real-valued reward. We also require that from every state s € 5/ player 1 can ensure to reach a state in S u 
with probability 1. In other words, for all s € 5/, there exists a player 1 strategy a such that against all 
player 2 strategies % we have Prf > n (Reach (S u )) = 1, where Reach(5 M ) is the set of paths that visit a state 
in S u . 

Discounted objectives. An objective / is a measurable function / : £1 — > R that assigns to every path 
a real-valued payoff. The discounted objective in two-level discounted games is a measurable function 
TwoDisc : Q. — » R defined as follows: for < /$ < 1, consider a path ft) = (sq,s\,S2,- ■ • ) and for an index 
i > 0, let 




SiESf, 

fi k ■ r($i) Sj € S u and the number of S u states in (sq, . . . is A: — 1 ; 



then 

oo 

TwoDisc(fti) = (1 - p) ■ £ a(i). 

i=0 

In other words, the payoff of a path is the normalized discounted sum of the rewards of the path and the 
discounting is applied for every upper level state. 

Optimal strategies. Given objectives / and — / for player 1 and player 2, respectively, we define the 
value functions {(l)) va i and ((2)} va i for the players 1 and 2, respectively, as the following functions from 
the state space S to the set R of reals: for all states s€5, let 

((l)hai(fm = sup infE^lf]; «2}> va/ (-/)(*) = sup inf Ef <*[-f\. 

In other words, the value {{^))wl(f){ s ) gives the maximal expectation with which player 1 can achieve 
her objective / from state s, and analogously for player 2. The strategies that achieve the value are 
called optimal: a strategy a for player 1 is optimal from the state s for the objective / if ((l)) va i (/)(*) = 
infjjgnlEf^f/]. The optimal strategies for player 2 are defined analogously. We now state the classical 
determinacy results for 2 i/^-player games with measurable objectives. 

Theorem 1 (Quantitative determinacy 1 6 1) For all 2 l /2-player game graphs G = 
((5 ', E) , (S\ , 52, Sp) , 8) and for all measurable functions f, we have ((l))voZ (/)(*) + (@f)val ( — /) (s) = 
for all states s G 5. 

The determinacy result follows for two-level discounted games. In the following section we will 
study the complexity of optimal strategies and the computational complexity of solving two-level dis- 
counted games. We first recall a result about the classical discounted games. The classical discounted 
games are special cases of two-level discounted games such that 5/ = 0; i.e., the game consists of only 
upper level states. We refer to this class of games as one-level discounted games. 

Theorem 2 (Memoryless determinacy of one-level discounted games 1 4 1) For all 2 l /2-player one- 
level discounted games, pure memoryless optimal strategies exist for both players. 
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3 Strategy and Computational Complexity 

We first show that pure memoryless optimal strategies exist in two-level discounted games. We first 
present a special class of two-level discounted games and reduce it to one-level discounted games. 

One-step two-level discounted games. The class of one-step two-level discounted games are the special 
case of two-level discounted games such that the following restrictions are satisfied: (a) 5/ C Sp (i.e., 
every lower level state is a probabilistic state); and (b) ECiSi x S C Si x S u (i.e., every successor of a state 
in Si is a state in S u ). In other words, in one-step two-level discounted games from any lower level state 
the upper level states are reached in one step with probability 1 . An one-step two-level discounted game 
can be reduced to one-level discounted games as follows. We convert every state in Si to a state in S u ; 
and the reward function is modified as follows: we add rewards to the states in Si to take care of the extra 
discounting step for converting a state in Si to a state in S u , i.e., for a state s € Si its reward is assigned as 

teS 

Since we can reduce one-step two-level discounted games to one-level discounted games, the existence 
of pure memoryless optimal strategies in one-step two-level discounted games follows. 

Theorem 3 Pure memoryless optimal strategies exist for both players in two-level discounted games. 

Proof. To prove the results we will use the existence of pure memoryless optimal strategies in reachabil- 
ity games, and present a reduction to one-step two-level discounted games. 

First, for states in 5; we consider a reachability game as follows: once the game reaches a state 
s E S u , then player 1 receives the payoff ((l)) va ;(TwoDisc)(j) and the game stops. The goal of player 1 is 
to maximize the payoff. Since this game is a reachability game, from the existence of pure memoryless 
optimal strategies in turn-based probabilistic reachability games [1], it follows that pure memoryless 
optimal strategies a* and n* exist for both players in this game. Since the reward function is positive, 
it follows that ((l)) va /(TwoDisc)(j) > for all s G S u . Since in two-level discounted games player 1 can 
ensure to reach S u with probability 1, it follows that once a* and %* are fixed from all states in Si states 
in S u are reached with probability 1 . Let T denote the random time when the game first reaches a state in 
S u , and &t denote the random variable for the T-th state. The strategies a* and n* ensure the following: 

1. for all strategies 71 and for all states s € 5/ we have 

<(l)) Vfl/ (TwoDisc)(s) > £ Prf' n (®T=t) ■ ((l)> ra /(TwoDisc(0); 

tes u 

2. for all strategies a and for all states s € Si we have 

((l)) v «/(TwoDisc)(j) < £ Pr^*(0 r =t) ■ «l)) va/ (TwoDisc(/)); 

teS u 

and 

3. Prf < oo) = 1 and «l)) va/ (TwoDisc)(j) = H teS Prf x (& T = t) ■ «l)) va/ (TwoDisc(0). 

We now present a reduction from two-level discounted games to one-step two-level discounted games. 
We replace each state s € S/ as a probabilistic state such that from s the successor state distribution is as 
follows: 8(s)(t) = Prf ,% (Reach it)) for t G S u . The following assertions hold in the one-step two-level 
discounted game for states in S u : 
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1. for all states s G S\ <~)S U , we have 



((l))va/ (TwoDisc) (s) = max j3 



t€E(s) 



r(5) + (l-/3)-((l)) va ,(TwoDisc(0); 



2. for all states s G 52 nS M , we have 



<(l» ra / (TwoDisc) (s) 



min p 

t£E(s) 



r(j) + (l-/3)-((l)) ra/ (TwoDisc(0); 



and 



3. for all states s G S> n S u , we have 



((l)) ra/ (TwoDisc)(5) = p • r(s) + (1 - j3) " ((l})va/(TwoDisc(0). 



The following assertions hold in the one-step two-level discounted game for states in Si : 

1. for all states s G Si CIS/, we have ((l)) va / (TwoDisc) (j) = max fe£ /y)((l}) va /(TwoDisc(f)); 

2. for all states s G S2 H >S U we have ((l)) va / (TwoDisc) (s) = min <eE(i )((l)) va /(TwoDisc(f)); and 

3. for all states s G 5pn5„, we have ((l)) va /(TwoDisc)(5) = L?es ^(s)^) • ((l)) va ;(TwoDisc(z)). 

From the above inequalities, the classical correctness proof for one-level discounted games, and the re- 
duction of one-step two-level discounted games to one-level discounted games, it follows that the values 
in the original game and the reduced one-step two-level discounted games coincide. The combination 
of the pure memoryless optimal strategy in the discounted game obtained after reduction, and the pure 
memoryless optimal strategy in the reachability game is a witness optimal strategy in the two-level dis- 
counted game. Hence the result follows. I 

Solution for two-level discounted MDPs. The existence of pure memoryless optimal strategies for 
two-level discounted games is proved by combining the solution of a reachability game and one-level 
discounted games. The result for MDPs follows as a special case. The value function for MDPs can be 
obtained from the solution of a linear programming problem which combines the linear programming 
solution for MDPs with reachability and one-level discounted objectives. The linear program for player- 1 
MDPs is as follows: the objective function is mines' x v subject to the following constraints 



The solution for player-2 MDPs is similar. This gives us the following result. 

Theorem 4 Given a two-level discounted game on a player-1 MDP or a player-2 MDP, the value at all 
states can be computed in polynomial time. 

A class of discounted games has the ordered field property if for every game TwoDisc in the class 
with rewards, transition probabilities, and discount factors chosen from a field F, we have that the value 
((l))va/ (TwoDisc) (s) is also in F for each state s. 



> 



> 



x t s £Sir\Su(s,t) g£; 

'LtesS(s)(t)-x t sGSiDSp; 

p-r(s) + (l-p)-x t seS u nSu(s,t) G£; 

P ■ r(s) + (1 -p) -ZtesS(s)(t) -x t sG S t nS P ; 
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Corollary 1 (Ordered field property) Given a two-level discounted game, if the rewards, discount fac- 
tor, and transition probabilities are rational, then the value at evey state is rational. The class of all 
two-level discounted games have the ordered field property. 

Proof. The results follows from the existence of pure memoryless optimal strategies and the existence 
of linear program that characterizes the values on MDPs. Once a pure memoryless optimal strategy is 
fixed we have an MDP, and by the linear program characterizing the value for MDPs it follows that if the 
rewards, discount factor and transition probabilities are rational, then the value at every state is rational. 
The ordered field property follows from similar arguments. I 

Complexity of two-level discounted games. Since pure memoryless optimal strategies exist for both 
players in two-level discounted games, and MDPs with two-level discounted objectives can be solved 
in polynomial time, it follows that the decision problem for the value function in two-level discounted 
games can be solved in NP n coNP. Hence we have the following result. 

Theorem 5 Given a two-level discounted game, a rational number q, a state s, and cxis {>,>,<,<,=}, 
whether ((l)) va /(TwoDisc)(j) tx q can be decided in NP n coNP. 

Algorithm for computing values. The existence of pure memoryless strategies ensure the correctness 
of the following naive algorithm to compute the values in two-level discounted game: (a) enumerate 
all pure memoryless strategies, and for each pure memoryless strategy compute the value for the MDP 
obtained by fixing the strategy (using the linear program), and (b) choose the value of the best pure 
memoryless strategy. The above algorithm is an exhaustive search on the set of pure memoryless strate- 
gies. We now describe an efficient search on the set of pure memoryless strategies given as a strategy 
improvement algorithm for two-level discounted games. The strategy improvement algorithm combines 
in a hierarchical fashion two classical strategy improvement algorithms: (a) the strategy improvement al- 
gorithm for stochastic games with discounted objectives [4] and (b) the strategy improvement algorithm 
for stochastic reachability games |2). 

The strategy improvement algorithm is as follows: (a) fix a pure memoryless strategy at the upper- 
level states; (b) apply the strategy improvement algorithm for reachability games for the lower-level 
reachability game to compute values given the strategy that is fixed in the higher-level game; and (c) once 
the values are computed, apply the strategy improvement step for discounted games to improve the 
upper-level strategy. The algorithm stops when no improvement is possible and obtains a pure memo- 
ryless optimal strategy. This gives us a strategy improvement algorithm to compute values in two-level 
discounted games. 

4 Conclusion 

We have introduced a new model of stochastic games that provide a uniform framework for decision 
making across different time scales. We have shown that pure memoryless optimal strategies exists in 
these games. Our framework subsumes classical discounted games, and provides a natural extension 
in which discounting is applied at different time granularities. We show that in our framework the 
solution for MDPs can be achieved in polynomial time matching the best known bound of MDPs with 
discounted objectives. For two-level turn-based stochastic games we show that whether the value is equal 
to a rational can be decided in NP n coNP, matching the best known complexity bound for discounted 
stochastic games. 
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